skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Choi, Y"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying *only* the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall. 
    more » « less
    Free, publicly-accessible full text available April 14, 2026
  2. We report the observation of an electronic reconstruction in dimensionally controlled ruthenate heterostructures synthesized by pulsed laser deposition. High structural and electronic quality of superlattices comprised of a single SrRuO3 layer inter-spaced with varying thicknesses of insulating SrTiO3 layers was verified by reflection high energy electron diffraction, atomic force microscopy, x-ray diffraction, reciprocal space mapping, and x-ray absorption spectroscopy. X-ray absorption spectroscopy evidences a confinement-driven evolution of the Ru electronic configuration from the d5L to the d4 state. Significant increases of the spin-orbit coupling are observed in connection with the configuration changes supporting recent works identifying large enhancement of the magnetic anisotropy. The growth of high quality two-dimensional confined ruthenate layers under precisely controlled environments highlights the potential to directly manipulate interlayer coupling and selectively perturb the electronic state in ruthenates in analogy to superconducting Sr2RuO4. 
    more » « less
  3. We present the photometric redshift characterization and calibration for the Dark Energy Camera All Data Everywhere (DECADE) weak lensing dataset: a catalog of 107 million galaxies observed by the Dark Energy Camera (DECam) in the northern Galactic cap. The redshifts are estimated from a combination of wide-field photometry, deep-field photometry with associated redshift estimates, and a transfer function between the wide field and deep field that is estimated using a source injection catalog. We construct four tomographic bins for the galaxy catalog, and estimate the redshift distribution, n ( z ) , within each one using the Self-organizing Map Photo-Z (SOMPZ) methodology. Our estimates include the contributions from sample variance, zeropoint calibration uncertainties, and redshift biases, as quantified for the deep-field dataset. The total uncertainties on the mean redshifts are σ z 0.01 . The SOMPZ estimates are then compared to those from the clustering redshift method, obtained by cross-correlating our source galaxies with galaxies in spectroscopic surveys, and are shown to be consistent with each other. 
    more » « less
    Free, publicly-accessible full text available October 22, 2026
  4. We present the pipeline for the cosmic shear analysis of the Dark Energy Camera All Data Everywhere (DECADE) weak lensing dataset: a catalog consisting of 107 million galaxies observed by the Dark Energy Camera (DECam) in the northern Galactic cap. The catalog derives from a large number of disparate observing programs and is therefore more inhomogeneous across the sky compared to existing lensing surveys. First, we use simulated data-vectors to show the sensitivity of our constraints to different analysis choices in our inference pipeline, including sensitivity to residual systematics. Next we use simulations to validate our covariance modeling for inhomogeneous datasets. Finally, we show that our choices in the end-to-end cosmic shear pipeline are robust against inhomogeneities in the survey, by extracting relative shifts in the cosmology constraints across different subsets of the footprint/catalog and showing they are all consistent within 1 σ to 2 σ . This is done for forty-six subsets of the data and is carried out in a fully consistent manner: for each subset of the data, we re-derive the photometric redshift estimates, shear calibrations, survey transfer functions, the data vector, measurement covariance, and finally, the cosmological constraints. Our results show that existing analysis methods for weak lensing cosmology can be fairly resilient towards inhomogeneous datasets. This also motivates exploring a wider range of image data for pursuing such cosmological constraints. 
    more » « less
    Free, publicly-accessible full text available October 22, 2026
  5. Abstract The metallicity distribution function (MDF) and internal chemical variations of a galaxy are fundamental to understand its formation and assembly history. In this work, we analyze photometric metallicities for 3883 stars over 7 half-light radii (rh) in the Sculptor (Scl) dwarf spheroidal (dSph) galaxy, using new narrowband imaging data from the Mapping the Ancient Galaxy in CaHK (MAGIC) survey conducted with the Dark Energy Camera (DECam) at the 4 m Blanco Telescope. This work demonstrates the scientific potential of MAGIC using the Scl dSph galaxy, one of the most well-studied satellites of the Milky Way. Our sample ranges from [Fe/H] ≈ –4.0 to [Fe/H] ≈ –0.6, includes six new extremely metal-poor candidates ([Fe/H] ≤ –3.0), and is almost 3 times larger than the largest spectroscopic metallicity data set in the Scl dSph. Our spatially unbiased sample of metallicities provides a more accurate representation of the MDF, revealing a more metal-rich peak than observed in the most recent spectroscopic sample. It also reveals a break in the metallicity gradient, with a strong change in the slope: from −3.26 ± 0.18 dex deg−1for stars inside ∼1rhto −0.55 ± 0.26 dex deg−1for the outer part of the Scl dSph. Our study demonstrates that combining photometric metallicity analysis with the wide field of view of DECam offers an efficient and unbiased approach for studying the stellar populations of dwarf galaxies in the Local Group. 
    more » « less
    Free, publicly-accessible full text available October 24, 2026
  6. We present the Dark Energy Camera All Data Everywhere (DECADE) weak lensing dataset: a catalog of 107 million galaxies observed by the Dark Energy Camera (DECam) in the northern Galactic cap. This catalog was assembled from public DECam data including survey and standard observing programs. These data were consistently processed with the Dark Energy Survey Data Management pipeline as part of the DECADE campaign and serve as the basis of the DECam Local Volume Exploration survey (DELVE) Early Data Release 3 (EDR3). We apply the Metacalibration measurement algorithm to generate and calibrate galaxy shapes. After cuts, the resulting cosmology-ready galaxy shape catalog covers a region of 5,412 deg2 with an effective number density of 4.59 arcmin−2. The coadd images used to derive this data have a median limiting magnitude of r=23.6, i=23.2, and z=22.6, estimated at S/N=10 in a 2 arcsecond aperture. We present a suite of detailed studies to characterize the catalog, measure any residual systematic biases, and verify that the catalog is suitable for cosmology analyses. In parallel, we build an image simulation pipeline to characterize the remaining multiplicative shear bias in this catalog, which we measure to be m=(−2.454±0.124)×10−2 for the full sample. Despite the significantly inhomogeneous nature of the data set, due to it being an amalgamation of various observing programs, we find the resulting catalog has sufficient quality to yield competitive cosmological constraints. 
    more » « less
    Free, publicly-accessible full text available October 22, 2026
  7. We present cosmological constraints from the Dark Energy Camera All Data Everywhere (DECADE) cosmic shear analysis. This work uses shape measurements for 107 million galaxies measured through Dark Energy Camera (DECam) imaging of 5 , 412 deg 2 of sky that is outside the Dark Energy Survey (DES) footprint. We derive constraints on the cosmological parameters S 8 = 0.791 0.032 + 0.027 and for the Λ CDM model, which are consistent with those from other weak lensing surveys and from the cosmic microwave background. We combine our results with cosmic shear results from DES Y3 at the likelihood level, since the two datasets span independent areas on the sky. The combined measurements, which cover 10 , 000 deg 2 , prefer S 8 = 0.791 ± 0.023 and under the Λ CDM model. These results are the culmination of a series of rigorous studies that characterize and validate the DECADE dataset and the associated analysis methodologies (Anbajagane et. al 2025a,b,c). Overall, the DECADE project demonstrates that the cosmic shear analysis methods employed in Stage-III weak lensing surveys can provide robust cosmological constraints for fairly inhomogeneous datasets. This opens the possibility of using data that have been previously categorized as ``unusable’’ for cosmic shear analyses, thereby increasing the statistical power of upcoming weak lensing surveys. 
    more » « less
    Free, publicly-accessible full text available October 22, 2026